{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preprocessing\n", "\n", "In this section, we learn how to preprocess data using [CSV.jl](https://csv.juliadata.org/stable/), [DataFrames.jl](https://dataframes.juliadata.org/stable/), [FeatureTransforms.jl](https://invenia.github.io/FeatureTransforms.jl/stable/), and [Statistics](https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics).\n", "\n", "## Reading the Dataset¶\n", "\n", "Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"../data/house_tiny.csv\"" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using CSV\n", "\n", "csv_data = \"\"\"\n", "NumRooms,RoofType,Price\n", ",,127500\n", "2,,106000\n", "4,Slate,178100\n", ",,140000\n", "\"\"\"\n", "\n", "dir_path = joinpath(\"..\",\"data\")\n", "file_path = joinpath(dir_path,\"house_tiny.csv\")\n", "mkpath(dir_path)\n", "CSV.write(file_path,CSV.File(IOBuffer(csv_data)))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let’s import `DataFrames.jl` and load the dataset with `CSV.read`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
4×3 DataFrame
RowNumRoomsRoofTypePrice
Int64?String7?Int64
1missingmissing127500
22missing106000
34Slate178100
4missingmissing140000
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& NumRooms & RoofType & Price\\\\\n", "\t\\hline\n", "\t& Int64? & String7? & Int64\\\\\n", "\t\\hline\n", "\t1 & \\emph{missing} & \\emph{missing} & 127500 \\\\\n", "\t2 & 2 & \\emph{missing} & 106000 \\\\\n", "\t3 & 4 & Slate & 178100 \\\\\n", "\t4 & \\emph{missing} & \\emph{missing} & 140000 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×3 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType \u001b[0m\u001b[1m Price \u001b[0m\n", " │\u001b[90m Int64? \u001b[0m\u001b[90m String7? \u001b[0m\u001b[90m Int64 \u001b[0m\n", "─────┼────────────────────────────\n", " 1 │\u001b[90m missing \u001b[0m\u001b[90m missing \u001b[0m 127500\n", " 2 │ 2 \u001b[90m missing \u001b[0m 106000\n", " 3 │ 4 Slate 178100\n", " 4 │\u001b[90m missing \u001b[0m\u001b[90m missing \u001b[0m 140000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using DataFrames\n", "data = CSV.read(open(file_path),DataFrame)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.\n", "\n", "You might have noticed that replaced all CSV entries with value with a special `missing` value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via *imputation* or *deletion*. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.\n", "\n", "Here are some common imputation heuristics. For categorical input fields, we can treat `missing` as a category. Since the RoofType column takes values `Slate` and `missing`, `FeatureTransforms.jl` can convert this column into two columns `RoofType_Slate` and `RoofType_missing`. A row whose roof type is `Slate` will set values of `RoofType_Slate` and `RoofType_missing` to `true` and `false`, respectively. The converse holds for a row with a missing `RoofType` value." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
4×3 DataFrame
RowNumRoomsRoofType_missingRoofType_Slate
Int64?BoolBool
1missingtruefalse
22truefalse
34falsetrue
4missingtruefalse
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& NumRooms & RoofType\\_missing & RoofType\\_Slate\\\\\n", "\t\\hline\n", "\t& Int64? & Bool & Bool\\\\\n", "\t\\hline\n", "\t1 & \\emph{missing} & 1 & 0 \\\\\n", "\t2 & 2 & 1 & 0 \\\\\n", "\t3 & 4 & 0 & 1 \\\\\n", "\t4 & \\emph{missing} & 1 & 0 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×3 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType_missing \u001b[0m\u001b[1m RoofType_Slate \u001b[0m\n", " │\u001b[90m Int64? \u001b[0m\u001b[90m Bool \u001b[0m\u001b[90m Bool \u001b[0m\n", "─────┼────────────────────────────────────────────\n", " 1 │\u001b[90m missing \u001b[0m true false\n", " 2 │ 2 true false\n", " 3 │ 4 false true\n", " 4 │\u001b[90m missing \u001b[0m true false" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using FeatureTransforms\n", "\n", "inputs, targets = data[:,1:2], data[:,3]\n", "\n", "# get all unique roof types\n", "roof_types = unique(inputs.RoofType)\n", "# one-hot encoding column RoofType, and append to original DataFrames\n", "inputs = FeatureTransforms.apply_append(inputs,OneHotEncoding(roof_types);cols=:RoofType,header=\"RoofType_\".*string.(roof_types))\n", "# drop column RoofType\n", "inputs = inputs[:,Not(:RoofType)]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For missing numerical values, one common heuristic is to replace the `missing` entries with the mean value of the corresponding column." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
4×3 DataFrame
RowNumRoomsRoofType_missingRoofType_Slate
Float64BoolBool
13.0truefalse
22.0truefalse
34.0falsetrue
43.0truefalse
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& NumRooms & RoofType\\_missing & RoofType\\_Slate\\\\\n", "\t\\hline\n", "\t& Float64 & Bool & Bool\\\\\n", "\t\\hline\n", "\t1 & 3.0 & 1 & 0 \\\\\n", "\t2 & 2.0 & 1 & 0 \\\\\n", "\t3 & 4.0 & 0 & 1 \\\\\n", "\t4 & 3.0 & 1 & 0 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m4×3 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType_missing \u001b[0m\u001b[1m RoofType_Slate \u001b[0m\n", " │\u001b[90m Float64 \u001b[0m\u001b[90m Bool \u001b[0m\u001b[90m Bool \u001b[0m\n", "─────┼────────────────────────────────────────────\n", " 1 │ 3.0 true false\n", " 2 │ 2.0 true false\n", " 3 │ 4.0 false true\n", " 4 │ 3.0 true false" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using Statistics\n", "inputs.NumRooms = replace(inputs.NumRooms, missing=>mean(skipmissing(inputs.NumRooms)))\n", "inputs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Conversion to the Matrix \n", "We can load `inputs` and `targets` into `Matrix`" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4×3 Matrix{Float64}:\n", " 3.0 1.0 0.0\n", " 2.0 1.0 0.0\n", " 4.0 0.0 1.0\n", " 3.0 1.0 0.0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "4-element Vector{Int64}:\n", " 127500\n", " 106000\n", " 178100\n", " 140000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "X, y = Matrix(inputs), Vector(targets)\n", "display(X), display(y);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.9.0", "language": "julia", "name": "julia-1.9" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }